上篇的實作雖然最終有個簡單的小結果,但終究是臨陣磨槍。
最主要的原因呢,當然就是...
其實光是弄前面的東西(API)和搞懂TFRecord就花一堆時間 QQ
藉口啦! 一切都是藉口啦!
對啦! 藉口,所以今天就可以順理成章地把過程貼上來完成今天的發文了 (灑花~
就不說認真如我,還是稍微地給他修改一些程式碼,用用心心地重新實作一遍耶!!!
那麼來簡單說說下面的內容會有什麼吧!
單純地測試性質,我使用了兩層的LSTM+Dropout
後面在接3層的FC
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm (LSTM) (None, 90, 50) 10400
_________________________________________________________________
dropout (Dropout) (None, 90, 50) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 90, 50) 20200
_________________________________________________________________
dropout_1 (Dropout) (None, 90, 50) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 30) 9720
_________________________________________________________________
dense (Dense) (None, 40) 1240
_________________________________________________________________
dense_1 (Dense) (None, 20) 820
_________________________________________________________________
dense_2 (Dense) (None, 10) 210
=================================================================
Total params: 42,590
Trainable params: 42,590
Non-trainable params: 0
_________________________________________________________________
繼上篇的
daily-historical-stock-prices-1970-2018
Dataset。
在使用TFRecord之前我先將.CSV
檔案轉存為.npy
檔案,省了$45%$以上的空間($2GB->1.13GB$)
names = pd.read_csv(dataset_path + dataset_dict['names'], engine='python')
companies = names.ticker.unique()
np.save('companies.npy', companies)
prices = pd.read_csv(dataset_path + dataset_dict['prices'], engine='python')
prices = prices.values
np.save('dataGen_full.npy', prices)
import os
import tensorflow as tf
import numpy as np
companies = np.load('companies.npy', allow_pickle=True)
prices = np.load('dataGen_full.npy', allow_pickle=True)
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _float32_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def create_tfrecords(writer, days, stock_sc):
stock_sc = np.asarray(stock_sc, np.float32)
max_days = len(stock_sc) - 10
for i in range(days, max_days):
x = stock_sc[i-days:i, 0]
y = stock_sc[i:i+10, 0]
feature = {
'x': _bytes_feature([x.tostring()]),
'y': _bytes_feature([y.tostring()])}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
# Data Normalization
from sklearn.preprocessing import MinMaxScaler
num_choose = 20
count = len(companies)
rand_company_indexes = np.random.randint(count, size=num_choose)
companies_choose = companies[rand_company_indexes]
# open close adj_close low high volume
scaler = MinMaxScaler()
days_before = 90
save_name = './testing.tfrecord'
with tf.python_io.TFRecordWriter(save_name) as writer:
for company in companies_choose:
indexes = prices[:, 0] == company
stock = prices[indexes, 1:]
training_data = stock[:, 2:3]
if len(training_data) < 100:
continue
# print(training_data.shape)
### scaler
get_sc = scaler.fit_transform(training_data)
# print(get_sc)
create_tfrecords(writer=writer, days=days_before, stock_sc=get_sc)
import tensorflow as tf
file = "training.tfrecord"
record_iterator = tf.python_io.tf_record_iterator(path=file)
count = 0
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
count += 1
print(example)
# Exit after 1 iteration as this is purely demonstrative.
break
print('total of count : ', count)
tf.data
API引入訓練資料一開始的參數設定,將batch設為3000,單純是參考別人的寫法
上篇只有使用200而已,感覺非常不夠,學不起來...(憑感覺講的
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
reshape_size = 90
length = 352983
batch_size = 3000
def extract_features(example, reshape_size):
features = tf.parse_single_example(
example,
features={
'x': tf.FixedLenFeature([], tf.string),
'y': tf.FixedLenFeature([], tf.string),
}
)
stock = tf.decode_raw(features['x'], tf.float32)
stock = tf.reshape(stock, [reshape_size])
stock = tf.cast(stock, tf.float32)
stock = tf.expand_dims(stock, -1)
label = tf.decode_raw(features['y'], tf.float32)
label = tf.reshape(label, [10])
label = tf.cast(label, tf.float32)
return stock, label
tfrecords_path = './training.tfrecord'
dataset = tf.data.TFRecordDataset(tfrecords_path)
dataset = dataset.map(lambda x: extract_features(x, reshape_size))
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.repeat()
train_gen = dataset.make_initializable_iterator()
可以透過summary()
來查看自己的模型!!
# LSTM Training
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
model = Sequential()
model.add(LSTM(units = 50, return_sequences = True, input_shape = (reshape_size, 1)))
model.add(Dropout(0.2))
model.add(LSTM(units = 50, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(units = 30, return_sequences = False))
model.add(Dense(units = 40))
model.add(Dense(units = 20))
model.add(Dense(units = 10))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.summary()
這邊特別使用callback
函數將我的參數記錄下來
下方352983這個數字單純是資料個數,總共有如此多筆
from tensorflow import keras
epochs = 20
checkpoint_path = './first_check/'
with tf.Session() as sess:
sess.run(train_gen.initializer)
checkpoint_file = checkpoint_path+"/cp-{epoch:04d}.ckpt"
cp_callback = keras.callbacks.ModelCheckpoint(checkpoint_file, save_weights_only=True, verbose=1, period=1)
train = model.fit(train_gen, epochs = epochs, batch_size = batch_size,
steps_per_epoch=352983 // batch_size, callbacks=[cp_callback])
今天仍沒有進行到測試階段 因為今天剛好有事情...來不及用QQ
最主要還是嘗試使用TFRecord看看,真的是超級難滴~
為什麼我一定要用TFRecord?
其實我並不是很想用,因為它很繁瑣、生成資料又慢...
唯一的優點是在讀取TFRecord的時候,並不需要將整份文件讀到內存
問號? 什麼意思?
所以儘管有這麼多缺點,卻都被其優點所無視了 !!
而我也是因為考慮之後可能會使用到TFRecord,那不如現在就先踩踩坑...
希望踩個5天我可以繼續我的股票預測 QQ
我還想用GCP上的API啊!!!
Using TFRecords and tf.Example